1 Goal

Describe variation in spawn timing and how it relates to environmental covariates.

2 Study Area and species

This study was conducted in the Middle Fork of the Salmon River (MFSR) in central Idaho (Fig. 1).

Map of the Middle Fork Salmon River (MFSR) study area showing redd locations used in the analysis (2002-2005) and stream reaches.
Map of the Middle Fork Salmon River (MFSR) study area showing redd locations used in the analysis (2002-2005) and stream reaches.

3 Data prep and inspection

3.1 Spawn timing data

Spawn timing data for Chinook salmon were collected from 2001 to 2005 in the MFSR. We removed data from 2001, and data from Knapp Creek and Cape Horn Creek, as these sites were not consistently sampled.

We consider the day of year a redd is assumed to be complete as our response variable (spawn timing).

Each redd location was assigned a COMID based on the NHD, which is analogous to a stream reach. The COMID is used to link the spawn timing data with covariate data associated with the stream reach on which it is located.

# Response data ---------------
spawn_data <- read_csv(here("data", "russ_spawn", "mfsr_spawn_cleaned.csv"))

# remove bad data
spawn_data <- spawn_data |>
  filter(stream != "Knapp" & stream != "Cape Horn" & year != 2001) |> 
  mutate(year = as.factor(year), stream = as.factor(stream), COMID = as.factor(COMID))

glimpse(spawn_data)
## Rows: 3,016
## Columns: 18
## $ UNIQUE_ID  <dbl> 1295, 1296, 1298, 1299, 1300, 1301, 1302, 1303, 1304, 1305,…
## $ COMID      <fct> 23519365, 23519365, 23519319, 23519319, 23519319, 23519317,…
## $ REACH      <chr> "BEARVAL1", "BEARVAL1", "BEARVAL1", "BEARVAL1", "BEARVAL1",…
## $ GNIS_NAME  <chr> "Bear Valley Creek", "Bear Valley Creek", "Bear Valley Cree…
## $ OBS_CLASS  <chr> "MONITOR", "MONITOR", "MONITOR", "MONITOR", "MONITOR", "MON…
## $ OBS_NAME   <chr> "RN", "RN", "RN", "RN", "RN", "RN", "RN", "RN", "RN", "RN",…
## $ DATE       <chr> "08232002", "08232002", "08232002", "08232002", "08232002",…
## $ FEATURES   <chr> "26", "27", "23", "24", "25", "22", "57", "71", "56", "72",…
## $ Feature_ID <chr> "26", "27", "23", "24", "25", "22", "57", "71", "56", "72",…
## $ OBJ_CLASS  <chr> "Redd", "Redd", "Redd", "Redd", "Redd", "Redd", "Redd", "Re…
## $ EASTING    <dbl> 629674, 629674, 629696, 629696, 629696, 629731, 629731, 629…
## $ NORTHING   <dbl> 4918590, 4918590, 4918716, 4918716, 4918716, 4918801, 49188…
## $ Comments   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ date       <date> 2002-08-23, 2002-08-23, 2002-08-23, 2002-08-23, 2002-08-23…
## $ year       <fct> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002,…
## $ yday       <dbl> 235, 235, 235, 235, 235, 235, 238, 246, 238, 246, 246, 238,…
## $ month      <dbl> 8, 8, 8, 8, 8, 8, 8, 9, 8, 9, 9, 8, 8, 9, 9, 8, 9, 9, 9, 9,…
## $ stream     <fct> Bear Valley, Bear Valley, Bear Valley, Bear Valley, Bear Va…
# length(unique(spawn_data$COMID))  # 104
# length(unique(spawn_data$stream))  # 8
# length(unique(spawn_data$year))  # 4

The data comprise 3016 redd observation from 8 streams across 4 years. The redds were observed between day 206 and 263.

3.2 Covariates

To test for environmental factors driving variation in spawn timing, we quantified associations with metrics describing thermal and physical conditions in stream reaches.

We selected covariates based on the following criteria: (1) they are known to influence spawn timing, (2) they are available for all streams, and (3) they are not highly correlated with each other.

Our focal independent variable were the average daily flow (cfs) and stream temperature (°C) during the 30, 60, and 90 days prior to spawning. We also included the slope of the stream reach and the mean elevation of the stream reach.

# make a df of just response and covariates
model_data <- combined_data |>
  filter(SLOPE < .2) |> # bad slope data
  select(
    yday, COMID, stream, year,
    flow_30, flow_60, flow_90,
    temp_30, temp_60, temp_90,
    SLOPE, mean_elevation
  )
glimpse(model_data)
## Rows: 3,013
## Columns: 12
## $ yday           <dbl> 235, 235, 235, 235, 235, 235, 238, 246, 238, 246, 246, …
## $ COMID          <fct> 23519365, 23519365, 23519319, 23519319, 23519319, 23519…
## $ stream         <fct> Bear Valley, Bear Valley, Bear Valley, Bear Valley, Bea…
## $ year           <fct> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2…
## $ flow_30        <dbl> 650.8333, 650.8333, 650.8333, 650.8333, 650.8333, 650.8…
## $ flow_60        <dbl> 997.4333, 997.4333, 997.4333, 997.4333, 997.4333, 997.4…
## $ flow_90        <dbl> 1992.733, 1992.733, 1992.733, 1992.733, 1992.733, 1992.…
## $ temp_30        <dbl> 12.70318, 12.70318, 14.02703, 14.02703, 14.02703, 13.78…
## $ temp_60        <dbl> 12.87280, 12.87280, 14.20337, 14.20337, 14.20337, 13.95…
## $ temp_90        <dbl> 11.07830, 11.07830, 12.15472, 12.15472, 12.15472, 11.92…
## $ SLOPE          <dbl> 0.00299158, 0.00299158, 0.00171806, 0.00171806, 0.00171…
## $ mean_elevation <dbl> 1951.730, 1951.730, 1946.735, 1946.735, 1946.735, 1944.…

4 Descriptive stats and visualizations

We evaluated variation in spawn timing using histograms, boxplots, density plots. We also calculated summary statistics for spawn timing.

4.1 Spawn time variation

  • mean and median are equal, low SD
  • var < mean (no overdispersion) data are right-skewed, lumpy/multimodal, and not symmetric at least
  • Poisson family is response is count of spawning events per day, Gaussian if model density as continuous

4.2 Spawn time variation by stream

4.3 Cumulative proportions of redds by stream

5 Bivariate relationships with covariates

  • temp_30 = no relationship, drop
  • temp_60 = good non-linear, drop for temp_90?
  • temp_90 = better non-linear
  • flow_30 and flow_60 = similar decaying exponential
  • flow_90 = inflections, interesting grouping really spreads out, year effect
  • SLOPE = no relationship, drop
  • mean_elevation = slightly negative linear?

6 Check for colinearity

model_data |> 
  select(temp_30, temp_60, temp_90, flow_30, flow_60, flow_90, mean_elevation) |> 
  GGally::ggpairs()

  • There is strong colinearity among temp variable. Retaining temp_90 as it has the strongest relationship.
  • flow_30 and flow_60 are highly correlated with flow_90. We will keep flow_90 as it has the strongest relationship.
  • elevation good to keep

Check VIFs with and without flow_30 and flow_60:

car::vif(lm(yday ~ flow_30 + flow_60 + flow_90 + temp_90 + mean_elevation + stream + year, data = model_data))
##                    GVIF Df GVIF^(1/(2*Df))
## flow_30        54.18575  1        7.361097
## flow_60        42.25228  1        6.500176
## flow_90        13.47150  1        3.670354
## temp_90        15.37376  1        3.920938
## mean_elevation 21.09913  1        4.593379
## stream         64.19481  7        1.346192
## year           54.26889  3        1.945771
car::vif(lm(yday ~ flow_90 + temp_90 + mean_elevation + stream + year, data = model_data))
##                     GVIF Df GVIF^(1/(2*Df))
## flow_90         9.605784  1        3.099320
## temp_90        14.146143  1        3.761136
## mean_elevation 20.418174  1        4.518647
## stream         53.393667  7        1.328594
## year            8.676807  3        1.433486
  • more reason to drop flow 30 and 90

7 Closer look at covariates

7.1 Temp_90

  • clear positive relationship, certainly some non-linearity
  • stream- and year-level variation (interactions)

Check simple models for temp to examine functional structure:

m1 <- lm(yday ~ temp_90, data = model_data)
m2 <- lm(yday ~ temp_90 * stream, data = model_data)
m3 <- lm(yday ~ temp_90 * stream + poly(temp_90, 2), data = model_data)
m4 <- lm(yday ~ poly(temp_90, 2), data = model_data)
m5 <- gam(yday ~ s(temp_90, k = 5), data = model_data, method = "REML")
m6 <- gam(yday ~ s(temp_90, by = stream, k = 5), data = model_data, method = "REML")

AIC(m1, m2, m3, m4, m5, m6) |> 
  arrange(AIC) |> 
  mutate(delta = AIC - min(AIC)) |>
  kableExtra::kbl() |> 
  kableExtra::kable_styling("striped", full_width = F)
df AIC delta
m6 32.383524 15894.98 0.0000
m3 18.000000 16192.42 297.4428
m2 17.000000 16547.05 652.0709
m5 5.985751 17629.35 1734.3790
m4 4.000000 17640.10 1745.1201
m1 3.000000 17983.29 2088.3110
m11 <- lm(yday ~ temp_90, data = model_data)
m22 <- lm(yday ~ temp_90 * year, data = model_data)
m33 <- lm(yday ~ temp_90 * year + poly(temp_90, 2), data = model_data)
m44 <- lm(yday ~ poly(temp_90, 2), data = model_data)
m55 <- gam(yday ~ s(temp_90, k = 5), data = model_data, method = "REML")
m66 <- gam(yday ~ s(temp_90, by = year, k = 5), data = model_data, method = "REML")

AIC(m11, m22, m33, m44, m55, m66) |> 
  arrange(AIC) |> 
  mutate(delta = AIC - min(AIC)) |>
  kableExtra::kbl() |> 
  kableExtra::kable_styling("striped", full_width = F)
df AIC delta
m33 10.000000 17283.11 0.0000
m66 17.942638 17327.23 44.1178
m22 9.000000 17459.41 176.2954
m55 5.985751 17629.35 346.2407
m44 4.000000 17640.10 356.9817
m11 3.000000 17983.29 700.1726
  • linear model is bad, quadratic is better, but GAM is best based on AIC
  • GAM is overfitting, and the quadratic model is flexible enough to capture the non-linearity

7.2 Flow_90

  • clear negative relationship, higher 90-day mean flows are associated with earlier spawn timing
  • relationship differs by year: 2003 linear but flattening or non-linear at higher flows
  • suggest different slopes or curves across years (year specific responses, but just intercepts)
  • stream-level variation as well, different intercepts and slopes
  • this is really a year effect with variation by stream and comid (local)

Check simple models for flow to examine functional structure:

# compare linear, quadriatic, and gam model
m1 <- lm(yday ~ flow_90, data = model_data)
m2 <- lm(yday ~ flow_90 * year, data = model_data)
m3 <- lm(yday ~ flow_90 * year + poly(flow_90, 2), data = model_data)
m4 <- lm(yday ~ poly(flow_90, 2), data = model_data)
m5 <- gam(yday ~ s(flow_90, k = 5), data = model_data, method = "REML")
m6 <- gam(yday ~ s(flow_90, by = year, k = 5), data = model_data, method = "REML")

AIC(m1, m2, m3, m4, m5, m6) |> 
  arrange(AIC) |> 
  kableExtra::kbl() |> 
  kableExtra::kable_styling("striped", full_width = F) |> 
  kableExtra::add_header_above(c(" " = 1, "AIC" = 2))
AIC
df AIC
m6 17.999597 7964.960
m3 10.000000 9654.838
m2 9.000000 10061.171
m5 5.998777 18324.766
m4 4.000000 18455.969
m1 3.000000 18662.693
# sjPlot::tab_model(m1, m2, m3, m4, m5, m6)
# summary(m2)

# prediction grid
pred_data <- model_data %>%
  group_by(year) %>%
  summarize(min_flow = min(flow_90), max_flow = max(flow_90)) %>%
  rowwise() %>%
  do(data.frame(year = .$year,
                flow_90 = seq(.$min_flow, .$max_flow, length.out = 100))) %>%
  ungroup()

# Predict
pred_data$yday_pred_m2 <- predict(m2, newdata = pred_data)
pred_data$yday_pred_m3 <- predict(m3, newdata = pred_data)
pred_data$yday_pred_m6 <- predict(m6, newdata = pred_data)

model_data |> 
  ggplot(aes(x = flow_90, y = yday, color = year)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "grey1") +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "grey50") +
  geom_smooth(method = "gam", formula = y ~ s(x, k = 5), method.args = list(method = "REML"), se = FALSE, color = "grey90") +
  geom_line(data = pred_data, aes(x = flow_90, y = yday_pred_m2), size = 1.2) +
  geom_line(data = pred_data, aes(x = flow_90, y = yday_pred_m3), size = 1.2) +
  geom_line(data = pred_data, aes(x = flow_90, y = yday_pred_m6), size = 1.2) +
  theme_custom() +
  labs(title = "Model fits", 
       subtitle = "",
       y = "Day of Year",
       x = "Flow (90-day mean pre-spawn)")

  • linear model is bad, quadratic is better, but GAM is best based on AIC
  • GAM is defintely overfitting becuase each year has a narrow flow range, only 2003 has high flows
  • poly fit with year intereaction is best non-GAM fit, but still probably overfitted

7.2.1 To include or not to include flow_90?

Including flow_90 could introduce spurious precision and possibly overfitting. Why flow_90 might be problematic:

  1. Not spatially resolved
  • We are modeling spawn timing at the redd level (COMID/stream)
  • But flow_90 is calculated from a single downstream gauge, and applied to all redds
  • This assumes flow conditions are identical across all sites, which is rarely true in a branching stream network
  • Including it gives the illusion of spatially resolved variation that isn’t there
  1. Likely correlated with year
  • Since flow_90 varies mostly across years, (albeit slightly with streams), it is strongly confounded with year
  • Any flow-related signal is probably already captured by your year fixed effect
  • Including both flow_90 and year risks collinearity, and may produce misleading inferences
  1. Spawn-time aligned flow ≠ experienced flow
  • While flow_90 is aligned to each redd’s spawn date, it still reflects a lower-basin gauge, not the actual hydrologic conditions experienced at the redd site
  • So it might be precisely wrong — aligned in time but irrelevant in space

Bottom Line: Unless we have site-specific or spatially disaggregated flow data, flow_90 is probably not a valid covariate for redd-level models.

Including it may:

  • Overfit due to noise or pseudo-replication
  • Complicate interpretation (e.g., why one stream “responds” to flows measured elsewhere?)
  • Mask true year or site effects

Recommendation: Drop flow_90 from model (or at most, keep it as a year-level covariate if you summarize it to a single annual value for all observations)

Although we initially considered including 90-day mean streamflow (flow_90) as a predictor of spawn timing, this variable was ultimately excluded due to concerns about ecological validity and model overfitting. Streamflow data were derived from a single downstream USGS gauge and did not capture spatial variation across the study streams or reaches. Moreover, because flow_90 was closely aligned with year, it introduced strong collinearity with the year effect and risked attributing site-level variation to flow patterns not actually experienced by individual redds. As such, we excluded flow_90 to avoid misleading inference.

8 Scale and final dataset

Final dataset includes:

  • response: spawn time (doy), continuous
  • grouping variables: comid, stream, year (but not using COMID due to sparse data)
  • temp_90, flow_90, elevation
# select variables and standardize continuous covariates
scale2 <- function(x, na.rm = FALSE) (x - mean(x, na.rm = na.rm)) / sd(x, na.rm)

# model_data_final <- model_data |>
#   select(yday, COMID, stream, year, temp_90, mean_elevation) 

model_data_final <- model_data |>
  select(yday, COMID, stream, year, temp_90, mean_elevation) |>
  mutate(across(c("temp_90", "mean_elevation"), scale2))

9 Model specification

9.1 Fixed and random effects

We should only include a grouping variable as both fixef and ranef when you want to model differenct aspects of the effect. E.g., stream as fixed to compared average effects by stream, but as ranefs to account for non-independence of obs within each stream but not to compare average effects. Include as both when you want population-level estimates AND when you expect stream-level random variation around those means. Do not include both when intercepts are redundant (~ stream + (1| stream)), or too few levels for the random effect to be estimated (e.g., <5 levels).

Variable Fixed? Random? Why
COMID No Yes (but must check data availability among levels; likely sparse) Not estimating individual COMID effects, just accounting for correlation.
Stream Yes Maybe (only if random slope or complex structure) May want to estimate differences and account for grouping.
Year Yes (comparing years) Maybe (account for inter-annual variability) Use one or the other depending on goal.

In this case, it makes sense to include year as fixed because:

  • only has 4 levels - so a ranef would estimate variance poorly and shrink aggressively
  • year captures unmeasured inter annual variability (snow pack, flow, temp anomalies all wrapped in)
  • we want to compare among years given we only have 4 years of data, hard to extrapolate

9.2 Overfitting

On our first couple passes, we used models with renefs, but were observing strong overfitting. Here’s what’s likely going wrong with the mixed-effects model:

High AIC-based model performance but stream-level predictions far from observed data , so random effects absorbing noise or overfitting sparse groups. Random intercepts for COMID dominate variance becuase many COMIDs have only 1 observation → intercepts become noise. Unequal data across years and streams means random intercepts get misled by imbalance, especially for sparse years (like 2005). Random slopes and interactions fail to improve fit because not enough data per group to support slope variation.

A simple lm may be better here because it treats stream and year as explicit, estimable fixed effects, which gives you real, interpretable estimates for group means. It doesn’t try to “guess” partial-pooling intercepts or slopes where data are lacking. The model becomes transparent — predictions reflect what’s in the data, not what the model infers from structure.

Let’s interrogate the grouping structure within the data to make a final decision.

9.3 Grouping structure of data

9.3.1 COMIDs per stream, and observations per COMID

Number of comids per stream
stream n_COMIDs
Big 21
Loon 19
Camas 16
Marsh 13
Bear Valley 11
Sulphur 11
Beaver 7
Elk 6

There are at least 6 COMIDs per stream, and the number of observations per COMID is highly variable. The histogram shows that most COMIDs have 1-2 observations, but some have many more. 23 COMIDs have fewer than 5 observations.

9.3.2 Summary

Are the enough COMIDs per stream to consider stream/COMID nested random effects? No.

Are groups well sampled? (Do most COMIDs have >1-2 redds?) No. 23 COMIDs have <5 redds (25%), 13 have <= 2. (12.5%) With <5 obs/level, variance estimates become unstable -> overfit and absorb noise (low AIC / high R2) -> singular fits.

Are year or stream-level random effects justifiable? We want to compare streams and years in this dataset, not generalize beyond them, so we should use fixed effects. Further, stream are known, of interest, and were not randomly sampled. Including (1 | stream) would be statistically redundant (stream intercepts would be double-modeled) and would lead to misleading AIC comparisons.

Random slopes?. The needs multiple obs per group across a range of slope variable (temp_90), enough replication to estimate variation in slopes, not just intercepts. Need ~8+ obs per group spread across the covariate. We could do this for stream or year. Prob not.

So, we should use a simple linear model with no random effects. This allows for interpretable results, no overfitting from random effects, and is a reasonable approach given the design.

If we did try (1 | COMID), expect singularity, low variance estimates for COMID, and predictions will likley miss the groups means.

10 Model fitting and comparison

10.1 Evaluate COMID as random intercept

##         df      AIC    delta
## m_rand  14 13541.11    0.000
## m_fixed 13 16245.22 2704.107

Does adding (1|COMID) improve it without overfiitg? Improves AIC by >2700. Is the variance explained by COMID non-zero and meaningful? Variance is 60 and likely meaninful. But, is it overfitting? Yes, the model is overfitting. The AIC is much lower, but the predictions are not close to the observed data (Not shown).

10.2 Final candidate covariates and interactions

  • temp_90: 90-day mean temperature
  • mean_elevation: mean elevation of the stream reach
  • stream: stream name
  • year: year of observation
Interaction Description Possible Interpretation
temp_90 × stream Temperature effects on spawn timing vary across streams Some streams may be more thermally sensitive (e.g., due to groundwater, shading, etc.)
temp_90 × year Interannual differences in how temperature affects spawn timing Annual climate conditions (e.g., snowmelt timing, baseflow) modify temp-timing response
stream × year Stream-specific interannual variation Some streams respond more strongly to wet/dry or hot/cool years
mean_elevation × stream Elevation effects on timing vary by stream Elevation proxies snowpack or gradient, which may matter differently by stream
mean_elevation × year Elevation effects change across years Snowmelt timing shifts may be more impactful in some years
temp_90 × mean_elevation Thermal effects depend on elevation High-elevation streams may respond more strongly or weakly to the same temp range

The most plausible core interactions to include a prior are: - temp_90 × stream - temp_90 × year

Others like stream x year are data hungry and may overfit without strong replication. Interactions with mean_elevation does not make much sense.

We used AIC-based model selection to identify the best-supported linear model that included temp_90, mean_elevation, stream, year, and all two-way interactions. This model outperformed all simpler alternatives by a large AIC margin. We then tested an alternative model including a raw quadratic term (I(temp_90^2)) to account for potential nonlinearity in the temperature–spawn timing relationship. Model performance was evaluated using AIC and visual assessment of stream- and year-specific predictions.

10.3 Simple LM

First we’ll fit simple linear models to get a sense of the data and the relationships. Adding new interactions each time.

# m0 <- lmer(yday ~ temp_90 + stream + year + (1 | COMID), data = model_data_final)
m1 <- lm(yday ~ temp_90 + mean_elevation + stream + year, data = model_data_final)
m2 <- lm(yday ~ temp_90 * stream + temp_90 + mean_elevation + stream + year, data = model_data_final)
m3 <- lm(yday ~ temp_90 * stream + temp_90 * year + temp_90 + mean_elevation + stream + year, data = model_data_final)
m4 <- lm(yday ~ temp_90 * stream + temp_90 * year + stream * year + temp_90 + mean_elevation + stream + year, data = model_data_final)

AIC(m1, m2, m3, m4) |> 
  arrange(AIC) |> 
  mutate(delta = AIC - min(AIC)) 
##    df      AIC     delta
## m4 43 14657.31    0.0000
## m3 24 15022.20  364.8860
## m2 21 15242.00  584.6876
## m1 14 15933.83 1276.5133

Full interaction (m4) is best. We will now compare all possible combinations of the interactions, except stream x year.

10.4 Dredge

# global_model <- lm(yday ~ temp_90 * stream + temp_90 * year + stream * year + temp_90 + mean_elevation + stream + year, data = model_data_final)
global_model <- lm(yday ~ temp_90 * stream + temp_90 * year + temp_90 + stream + year + mean_elevation, data = model_data_final)

# Run dredge
options(na.action = "na.fail")  # Required for dredge
model_set <- dredge(global_model, trace = 1, rank = "AIC")

model_set

Model 64 with all 2-way interactions (save stream x year) is the best supported model based on AIC.

NOTE ELEVATION IS THE ISSUE CAUSING OFFSETS!!!! The best fitting model without elevation in the full model with all 2-way interactions. When plotting predictions using the model without elevation, the linear offsets are no more. This makes sense because the estimated coefficients for models with elevation are positive, which does not align with the observed data:

final_model <- get.models(model_set, 1)[[1]]
tmp <- ggeffects::ggpredict(final_model, terms = c("mean_elevation"))

ggplot(tmp, aes(x = x, y = predicted)) +
  geom_line(size = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  geom_point(data = model_data_final, aes(x = mean_elevation, y = yday), size = .5) + 
  theme_custom() + 
  labs(title = "Elevation effect on spawn timing",
       subtitle = "Note the strong mismatch between predicted trend and observed data",
       x = "Mean Elevation (m)", y = "Spawn Day of Year") 

So we will select the best fitting model without elevation (model 127; AIC rank = 6)

11 Diagnostics

final_model <- get.models(model_set, 4)[[1]]

12 Predictions

  • Some stream-specific fits still under perform, especially at extremes
  • Year-specific predictions are systematically low (in both linear and poly models)
  • The orthogonal poly(temp_90, 2) didn’t resolve this and may be too constrained
  • Raw quadratic (I(temp_90^2)) might help fine-tune curvature per stream or year

13 Quadratic model

Next we will fit a new best + quadratic model.

##             df      AIC    delta
## lm_poly     24 15521.32   0.0000
## final_model 23 15731.60 210.2749

The poly model improves AIC by ~210, major improvement. So the curvature is helping.

  • The quadratic model fits the data much better, especially at the extremes
  • The stream-specific fits are much better, but the year-specific fits are still systematically low (NOT ANY MORE - removed elevation)

13.1 Compare linear and quadratic model plots

  • Dashed vs solid lines show differences between models
  • Improved fit near the tails or mid-range = quadratic model success

14 Model fit comparison

14.1 Calculate RMSE

# get preds for each model
model_data_final$pred_linear <- predict(final_model, newdata = model_data_final)
model_data_final$pred_quad   <- predict(lm_poly, newdata = model_data_final)

# compute error metrics by stream and year
model_error_summary <- model_data_final %>%
  mutate(error_linear = abs(yday - pred_linear),
         error_quad   = abs(yday - pred_quad)) %>%
  group_by(stream, year) %>%
  summarize(
    n = n(),
    MAE_linear = mean(error_linear),
    MAE_quad   = mean(error_quad),
    RMSE_linear = sqrt(mean((yday - pred_linear)^2)),
    RMSE_quad   = sqrt(mean((yday - pred_quad)^2)),
    .groups = "drop"
  )

# Long format
model_error_long <- model_error_summary %>%
  pivot_longer(cols = starts_with("RMSE"), names_to = "model", values_to = "RMSE") %>%
  mutate(model = ifelse(model == "RMSE_linear", "Linear", "Quadratic"))

14.2 Stream level RMSE comparison

14.3 RMSE difference by stream

14.4 RMSE difference by stream and year

  • facets show stream-level RMSE differences
  • patterns across years may highlight where curvature helps (e.g., early, warm, or dry years)